Introduction. The biological problem

  • In this class we will be moving into the discussion of problems of similarity in sequence analysis. Two basic questions that we will deal with are:
  1. Given two sequences of comparable length, what is the best way to objectively define a measure of their of their similarity?

  2. Given two sequences with significantly different lengths, how can we identify all identical matches of the shorter within the longer one?

Part A. Objectively defining the similarity between two sequences

Biological Context

This is one of the most primary problems of bioinformatics with its roots in the study of: a. Phylogenetic relationships at the molecular level, e.g. the comparison of orthologous sequences b. Evolutionary dynamics in genomes, e.g. the identification of gene duplications, rearrangements etc c. Genomic variability, e.g. comparisons of the same sequence among different individuals

Sequence Similarity. The problem

  • How similar are two sequences?
  • Let’s start by looking into a very simple example of two 10-nucleotide sequences
G G G A A T T T C C
G G C A A T T T C C

It is obvious that they basically differ in only one nucleotide. The question is how we can quantify this?

Measures of sequence distance

In Computer Science this is a problem of “character string comparison” (or “string comparison”). There are many measures for the quantification of such similarities, the simplest of which is called “Edit Distance” or “Hamming Distance”. We can define it as following:

  • Edit distance is the number of residues we need to edit/substitute to obtain one sequence from the other, without re-arrangements, deletions or insertions.

Edit Distance of two Sequences

In the figure you see that between the two sequences there are 9 identical residues, meaning only one change is needed. We say that they have \(Distance=1\). This can be scaled for the total size of the combined sequences \(Distance=1/10=0.1\)

In a similar way we can quantify the distance of the following pair of sequences as \(Distance = 3/10 = 0.3\)

Distance 2

Going beyond distance measures

What if the sequences are not so similar? In the example we see below the distance is calculated as \(D=9/10=0.9\). This means that the two sequences are only 10% similar to each other?

Are they though?

If one “slides” the two sequences against each other we can obtain a higher similarity as residues become “aligned”

The sequences now have a much lower distance than 0.9, even allowing for the fact that the combined length of the two sequences is 12 nucleotides \(D=4/12=0.33\).

The main point here is that we have allowed ourselves the additional liberty of displacing the two sequence against each other, without changing the order of residues in either of the two

The Sequence Alignment Problem

The highlighted text above is a good definition of Sequence Alignment.

This is:

  • Assuming two sequences, which is the best way we can “slide” one against the other, without altering the order of residues, in order to maximize their similarity?

In the example shown below, we see that two sequences can be “aligned” with more than one ways.

Both of the following sequences give rise to 5 “matching” residues, but in the first there 4 more “slides” required to open a gap in the middle of the second sequence. Do we prefer the first or the second? Should we take into account the “matching” residues or also the number of “gaps” in the alignment?

The question that arises is then related to that best way we mentioned earlier. The problem is now reduced to the following:

  • Given two sequences, how can we define which of their possible alignments provides the highest similarity score provided an evaluation of matches, mismatches and gaps

We will approach this through …

…A not so biological problem

We will use an analogy for the “optimal pairwise alignment problem” in which:

  • My friend V. wants to take a sailing trip in the Aegean.

  • He checks the weather: N/NW winds throughout the whole period mean he will have to set a course S/SE.

  • Moving S/SE and starting from Rafina (START), he needs to find the best route in order to reache Astypalaia

  • His choice of route will depend on a grading of possible destinations based on the places he wants to visit the most.

  • We will start by making a grid which we will use to grade each destinations.

  • We will also note the permissible map transitions, which can only be: a) left-to-right horizontally b) top-to-bottom vertically and c) topleft-to-bottomright diagonally

Mapping a route

Best possible route. Take 1

  • Move from START to END making the best choice at each step

Best possible route

This itinerary gives a total score of 23, which is the sum of the scores as V. goes through the marked squares in the matrix map.

The question is:

Is there a better way?

The approach described above is what we call a “greedy” approach in Computer Science, that is simply going for the best case in every step without taking into account alternatives we may be disregarding once committed to a choice.

In the example shown above, this is seen if you consider that once V. has committed himself to reaching Mykonos, he cannot go back to any of the squares in the first two columns of the map (some of which carry formidable scores). This is because of the restriction of only making left-right, top-bottom displacements.

The “best” possible way by mean of “all ways”

What if we scored all the possible routes and then just chose the one with the highest scores?
This is what computer scientists would call a “brute force” or “exhaustive” approach. Even though it is by definition correct, it is impractical in most of the cases. If you can’t think why, try figuring out the number of possible itineraries for this simple 5x6 matrix. Even for such a small matrix the number is in the few hundreds (!!!).

We thus have to find a better way.

Dynamic Programming for the “Sailor Problem”

We will now introduce an alternative strategy that guarantees the best solution without exhaustively searching all possible trajectories in the matrix.

The main idea behind the dynamic programming algorithm is to solve a bigger problem by iteratively solving many smaller (and for that easier to solve) subproblems.

In our case, the subproblems will consist of identifying the best possible way to reach every square/position in the matrix.

Once we have done this we can easily reconstruct the optimal itinerary by tracing the optimal solutions from each square.

Let’s see how this is done step-by-step

Step 1. Fill in the first row and the first column

This is easy since for these squares we have no alternatives. We can only reach squares of the first row with horizontal displacements and those of the first column with vertical.

We can thus score the part of the matrix shown below:

Alignment Step #1

Step 2. Score a 3-to-1 square case

With the first row and column filled we now have one square in the matrix that we can “solve”. This square is the one lying at the top-left corner of the “unsolved” part of the map (in this case it is the island of Tinos) and we can solve it if we ask the question, which is the best way to reach it?

This is now solvable because we have all the available information. Consider that in order to solve a square we need to have the scores from all three surrounding squares (top, left and top-left). Thus for the square \(M[i,j]\), we need the scores of \(M[i-1,j]\), \(M[i, j-1]\) and \(M[i-1,j-1]\). Then we simply have to calculate which of the three sums of those scores with \(M[i,j]\) is the maximum. \(M[i,j]\) will now become: \[M[i,j]=max(M[i-1,j]+M[i,j], M[i, j-1]+M[i,j], M[i-1,j-1]+M[i,j])\]

Once we identify the maximum of the three sums we keep it as new score and we take note of the direction from which it was obtained. We call this the “trace” of the square and it is important to help us draw the route.

This is shown below for the case of Tinos. The best way to reach Tinos is by way of Andros approaching from the north (top-bottom displacement)

Alignment Step #2

Step 3. Iteratively score all squares

Consider what this solution allows us to do. We have just solved the subproblem of the optimal root for the 2X2 top-left subsquare of the matrix. This means that if our question was “Which is the best way to reach Tinos?” we would know the solution by tracing backwards from Tinos: a. The best way to reach Tinos is through Andros b. The best/only way to reach Andros is from Start c. Thus the root is START->Andros->Tinos with a total score of M[i,j]

We can use the same strategy used for Tinos for every square for which we have score the 3 surrounding squares. Thus as we work our way from top-left to bottom-right we can iteratively score the complete matrix.

Alignment Step #3

Step 4. Backtracking

Once all squares have been scored, all we need to do is look at the last (bottom-right) corner of the matrix. The score it contains is the total score of the route.

Compare the score in Astypalaia with Dynamic Programming (31) with the one obtained with the greedy approach (23).

For the complete route we now have to “backtrack” going back from the last square and following the traces (arrows) that we have marked as the best origin for each square.

So, this is the route we have identified based on the scores that V. gave us.

Backtracking for the best route

Back to Sequence Alignment

Nice, but how does this help us to align sequences?

It is basically the same problem. Here are the analogies. 1. Aligning to sequences is a process of reading both sequences from beginning to end. If we place one sequence on the rows of a matrix and the other on its columns, then what we need to do is to go from the top-left to the bottom-right corner of the \(S_1XS_2\) matrix. 2. We are not allowed to change the order of the residues or re-read parts of any sequence. This means that we can only read either just one (left-to-right), just the other (top-to-bottom) or both sequences at the same time (diagonally topleft-to-bottomright). 3. We have a system to score our displacements on this matrix. This will be based on whether the comparison is favourable. In the simplest case we just need three scores for the three possible outcomes: matching residues, not-matching residues and gap (match/mismatch/gap)

In the following figure you see two alternative alignments using the dynamic programming algorithm we described above (it is called Needleman-Wunsch by the two scientists who invented it).

The difference between the two alignments is the scoring scheme. In the left we use a Match=1, Mismatch=-1, Gap=-1 scoring table in the second a Match=1, Mismatch=-2, Gap=-1.

Notice the track of the alignment shown in red. When mismatches and gaps are penalized equally, the displacements occur largely through the diagonal. When mismatches are penalized more the diagonal is avoided in the region where there is no similarity and horizontal-vertical (gap) displacements are more prominent.

X

Protein Sequence Alignments use more detailed score matrices

Scoring schemes are thus very important. They largely shape both the alignment and the final score. This is why, especially when comparing protein sequences, we avoid using such simplified schemes of three scores. One has to consider the physical properties of the aminoacids.

The similarities and differences between aminoacids mean that we should not consider the substitution (mismatch) of an aspartate by a glutamate to be the same as an aspartate being substituted by lysine. In the same way the maintenance (match) of a valine is not as important as that of a tryptophane, which is a rare, essential aminoacid. To account for all these relationships we use pre-calculated matrices, which we employ as “look-up” tables every time we move diagonally in an alignment matrix.

Below you can see one such matrix, the BLOSUM50 created from the comparison of a large number of similar protein alignments.

The BLOSUM50 substitution matrix

Part B. Going a bit faster

The Problem with the alignment approach

It takes approximately (NxM) calculations to align a sequence of lenght N to a sequence of length M. If we want to align a lot of sequences we need to be able to do it faster, especially if we just want to find their best match in a much larger sequence (e.g. a complete genome)

Big Data Alignment Solutions. BLAST

BLAST is an approximate approach that filters out most of the sequence targets before going further. BLAST searches for sequence similarity of a “query” sequence against a “target” database. You may see a read as a “query” and the reference genome as the “target”

BLAST

BLAT (BLAST-like tool)

BLAT is a BLAST-like approach for quick searches in a particular genome sequence instead of entire sequence databases. It was quite popular before the advent of NGS approaches which have posed a real bottleneck even for the fastest comparison algorithms existing at the time.

BLAT, a BLAST-like tool

Thus even the fastest alignment and rapid-search approaches are not fast enough for NGS data. In the following we will see why we have to resolve to data-transformation approaches to tackle the read alignment problem.

Part C. NGS Approaches to Fast String Matching

We need something stronger

Up to now the approaches we have discussed look at either one-vs-one exact sequence alignments, or fast approaches for approximate matching of one vs many sequences.

In NGS data we a slightly different task from a qualitative point of view. This is the exact match of many-to-one sequences. More importantly, it is also quite different from a quantitative point of view. A typical NGS experiment creates ~100Mi sequence reads of ~200bp that have to be matched against a genome of ~3Gbp. This is impossible to carry out getting one read at a time and matching it against the reference genome. It is the equivalent of having a highway that extends for hundreds of kilometers and allowing only one car to use it every time.

Data Transformation Approaches

In order to be able to use the “highway” more effectively we have to transform the highway so that it is more easily accessible. The solution for the NGS “read mapping problem” is thus based on manipulating the reference genome sequence.

Data transformation methods, create an easily searchable/mappable version of the target sequence (the Reference Genome). Although memory-demanding they make the process much faster and allow the repetitive sequence search at great speed and efficiency. Examples of such transformations are the Suffix trees and the Burrows-Wheeler Transform (BWT).

Below we will discuss the first of the two.

Suffix Trees

Suffix Trees (ST) are an efficient way to create a rapidly “searchable” representation of a long sequence.

Imagine the sequence \(S=GAGTAAGTCA\). We will create its ST, starting with the creation of all the suffixes(=endings) of the sequence. In order to create its suffixes (as dictated by the “suffix”) we simply append a $ sign at the end of each possible suffix of \(S\).

Suffixes S
GAGTAAGTCA$
AGTAAGTCA$
GTAAGTCA$
TAAGTCA$
AAGTCA$
AGTCA$
GTCA$
TCA$
CA$
A$

We then order all the suffixes alphabetically

Suffixes So
A$
AAGTCA$
AGTAAGTCA$
AGTCA$
CA$
GAGTAAGTCA$
GTAAGTCA$
GTCA$
TAAGTCA$
TCA$

At the same time we keep the positions of each suffix in the original sequence

Suffixes
A$ [10]
AAGTCA$ [5]
AGTAAGTCA$ [2]
AGTCA$ [6]
CA$ [9]
GAGTAAGTCA$ [1]
GTAAGTCA$ [3]
GTCA$ [7]
TAAGTCA$ [4]
TCA$ [8]

Building a Suffix Tree

A Suffix Trie looks, well, like a tree. You start with a root and then proceed in iteration by: 1. adding each suffix to the root given that its prefix doesn’t already exist or 2. adding each suffix to an already existing prefix if it’s suffix is matching it

Suffix Tree Building

Once we go through the whole list of suffixes the tree can be compressed in order to have all non-branching clades in “leaves”. Even so, compare the size of the tree with the original 10-nucleotide sequence.

Suffix Tree Building

Using a Suffix Tree

Why did we bother to use up so much memory and time to create the ST? It is because the tree has some specific properties that allow us to locate identical matches in an ultra-fast and efficient way.

The basic property of the ST is that every substring of \(S\) is a prefix of some suffix of \(S\). This means that every sequence contained in \(S\) will be connected to the root and thus we can tell if a shorter sequence exists in the main sequence just by looking at the tree.

We can actually do much more like:

  • Decide if a pattern exists in a sequence
  • Find a pattern in a sequence
  • Find how many times a pattern occurs in a sequence
  • Find the position(s) of a patten in a sequence
  • Find the longest repeat in a sequence
  • Find the longest common pattern between two sequences
  • etc.

Suffix Trees Games

We can check the existence of a subsequence starting from the root and working our way down the tree. If we manage to read all the subsequence inside the tree, then by definition is is part of the longer, main, sequence.

In the figure below we check if \(AGT\) is contained in the orignal sequence.

It turns out that it does (remember the original sequence was \(S=GAGTAAGTCA\)).

Finding repeats in a sequence

Even better the ST can help us identify how many times the sequence exists in the main sequence. This is given directly by the number of branches coming out of the clade that we have followed in our search. In the figure above \(AGT\) branches out into two leaves, which means that it is contained twice in the original sequence (again remember it was \(S=GAGTAAGTCA\)).

Last but not least we can identify the positions in the original sequence. Remember that we have kept those for every suffix. We have not drawn this in the tree but if we go back to the original list:

Suffixes
A$ [10]
AAGTCA$ [5]
AGTAAGTCA$ [2]
AGTCA$ [6]
CA$ [9]
GAGTAAGTCA$ [1]
GTAAGTCA$ [3]
GTCA$ [7]
TAAGTCA$ [4]
TCA$ [8]

We can see that the branches starting with \(AGT\) are the ones at positions 2 and 6. Indeed these are the positions of \(AGT\) in the sequence:

1 2 3 4 5 6 7 8 9 10
G A G T A A G T C A
  A G T   A G T

Suffix Trees were used for some of the first ultra-fast NGS mapping software. Even though they are now largely replaced by Burrows-Wheeler Transform based approaches the concept of data transformation is very similar (and similarly ground-breaking).

Assignment

Introduction

The goal of this assignment will be to combine sequence similarity analysis with the clustering approaches we discussed last time.

Histone H4 is one of the most well-preserved proteins among eukaryotic organisms. In this assignment, you will have to assess the degree of conservation through Needleman-Wunsch pair-wise alignment in specific protein sequences derived from a broad group of eukaryotes.

The file you will find here contains the H4 protein sequences from the following organisms: (Homo sapiens: HUMAN, Mus musculus: MOUSE, Drosophila melanogaster: DROME, Arabidopsis thaliana: ARATH, Bos taurus: BOVIN, Caenorhabditis elegans: CAEEL : CHICK, Rattus norvegicus: RAT, Saccharomyces cerevisiae: YEAST, Xenopus laevis: XENLA)

Stage 1

You are asked to:

  1. Align all sequences with each other in pairs using the BLOSUM50 replacement table.
  2. Indicate the alignment scores on a square table of \(NXN\) dimensions

Auxiliary information: It is recommended that you use the the R::seqinr and BioConductor::Biostrings libraries. The installation of the former is done directly by R

install.packages("seqinr")

while for Biostrings you have to go through Bioconductor:

BiocManager::install("Biostrings")

The reading of the sequences can be done with the \(read.fasta()\) function of seqinr. Alignment will be performed with the Biostrings \(pairwiseAlignment()\) function using the substitutionMatrix = “BLOSUM50” parameter.

Stage 2

In this second stage you will use the data from the previous query. The purpose here is to estimate the evolutionary distances between the sequences based on pairwise alignment. You are required to:

  1. Convert the similarity table from Stage 1 to a distance table. You can do this by taking the maximum similarity max(S) of the whole table and applying the formula \(D_i=(1-S_i)/Max(S)\) to all \(S_i\) values.
    2.Use the distance \(D\) table as input to create a UPGMA tree. You can use the \(hclust()\) function of the basic R.

  2. Return the tree and comment if it has an image that reflects the complexity relationships between the organisms